In this document, exploratory data analysis is carried out with bioclimatic, soil, land cover and human footprint variables. Let us remember that the bioclimatic variables are consolidated from the years 1970 to 2000. The purpose of this document is to compare flowering dates between countries, validate relationships that have been reported in literature (for example, the relationship of temperature with DOY), determine the existence of interaction effects and establish which of the predictor variables could eventually be candidates to include in the modeling (selection of predictors). We could say that the variables analyzed in this document are “static” or “fixed”, since we do not have a temporality of them, they are consolidated data in pre-established time windows.
The following document (06-EDA2-FE.qmd) shows results of exploratory analysis for “dynamic” variables, I refer to them as dynamic since the climatological and photoperiod variables have temporality (data newspapers) and obtained information from January 1, 1981 to December 31, 2023.
In several of the results of that document USA-WDC is not included because it is only a coordinate (lat=38.88535, long=-77.03863) and the results of USA-NPN are not shown either, the reason is that the bioclimatic variables are not for the USA-NPN coordinates, however, as climate information was provided by NPN, I use this information for exploratory analysis at the end of this document and also in the next one (06-EDA2 -FE.qmd).
df_full |>filter(location !="NPN") |>ggplot(aes(x = bloom_doy)) +facet_wrap(~country, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by country",subtitle ="Original scale")df_full |>filter(location !="NPN") |>ggplot(aes(x = bloom_doy)) +facet_wrap(~country, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +scale_x_log10() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by country",subtitle ="Logarithmic scale")
Scatter plot of all vs DOY
In these graphs I apply the distinct() function for latitude and longitude because the same coordinate (even if they are in different years) has the same bioclimatic, soil and cover data, since this information does not have temporality. For that reason I summarize with the median of the bloom_doy (since this variable does change over time).
The following graphs allow us to show relationships (linear and non-linear) between some of the bioclimatic variables and the day of flowering. Associations are also observed between soil composition, for example, pH and DOY, the relationships are not meaningful or clear. relationships between the variables of land cover, human footprint, grasslands, shrubs, water and flowering date.
With these graphs it is validated or confirmed that the temperature (bio1, bio5, bio8, bio9, bio10 and bio11) and the flowering day exhibit a relationship that we could suggest is between linear and quadratic, independent of the country, at higher temperatures earlier flowering has been observed. This behavior is similar in Japan and Switzerland. However, South Korea presents some differences, for example, in the bio11 variable for Japan and Switzerland. , the quadratic type relationship is evident, but in South Korea it is not.
Precipitation (bio12 to bio19) also shows interesting relationships with flowering date, with slight differences between countries, for example, annual precipitation (bio12) shows an inverse linear relationship with DOY, In Switzerland, on the contrary, we observe a positive quadratic relationship and in South Korea the relationship is not clear. This result may suggest that temperature and precipitation are factors that interact and have a joint effect; Later, interactions between variables are explored to determine covariations of interest.
The seasonality of temperature (bio4) shows a different pattern of behavior between countries, in Japan the relationship is positive and linear, in South Korea the association is not clear and in Switzerland it shows an inverse quadratic relationship. Let us remember that the bioclimatic variables (bio_) collect information from the years 1970 to 2000, with a resolution of \(1 km^2\), this finding will be compared later with the daily temperature information to validate the relationships that are evident in this graphic.
Regarding soil composition, nitrogen, soil water pH, soil organic carbon content (SOC), bulk density (BDOD) and cation exchange capacity (CEC ) show trends in the relationship with DOY. The other variables appear to have no association (linear or non-linear) with the target variable.
Finally, the information on cropland (CROPLAND), human footprint (FOOTPRINT) and land cover do not exhibit any type of relationship that could be considered of interest for subsequent analyses. Furthermore, some of these variables (for example example SHRUBS) lack information for the coordinates under analysis.
The trend line (red color) in these graphs was obtained with a generalized additive model (GAM) using natural splines with two degrees of freedom.
These correlation profiles show for each country and all countries, from highest to lowest correlation (Spearman). In Japan and South Korea, cherry trees located further north seem to take longer to flower (blue colors), the variable latitude is the first in the correlation profile of these countries, however, in Switzerland it is not the latitude The linear factor with the greatest influence is altitude followed by soil composition variables and in the top 10 are three variables that collect precipitation information. If we look at the lower part of the graph (red colors), in all countries the annual temperature variable (bio1) coincides in being the inverse linear factor of greatest magnitude, followed by other indicator variables also of temperatures. This graph is very useful not only to represent linear relationships but also to obtain a quick characterization of the relationship between climate and phenology in cherry trees located in different countries. We can say that if we think about the factors that are associated with ** lateness** in flowering, the profiles between countries are very different, however, if we think about the factors that are associated with earliness in flowering, the profiles between countries are very similar. In conclusion, flowering faster seems to have common and global “causes” linked to temperature, however, the delay in flowering could be more difficult to establish and generalize.
With the following graphs I intend to recognize if there are interaction behaviors, that is, I try to confirm from visual representations if there are joint effects between the predictor variables. Given that the set of predictor variables is “large” \((P = 41)\), generating all the interactions (double, triple or higher hierarchy) could be costly and time-consuming. The following are the two approaches that I implement:
1. Initially I am going to plot scatter plots with the variables that showed trends in the scatter plots and correlation profiles; the color of the dots is determined by the response variable bloom_doy.
2. I graph the double interaction for some variables of interest vs the response variable.
Scatter plot
In these graphs I apply the distinct() function for latitude and longitude because the same coordinate (even if they are in different years) has the same bioclimatic, soil and cover data, since this information does not have temporality. For that reason I summarize with the median of the bloom_doy (since this variable does change over time).
The trend line (red color) in these graphs was obtained with a generalized additive model (GAM) using natural splines with two degrees of freedom.
scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Annual precipitation (mm)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Annual precipitation (mm)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Annual precipitation (mm)")
Code
scatter2DTarget(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Nitrogen (cg/kg)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Nitrogen (cg/kg)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Nitrogen (cg/kg)")
Code
scatter2DTarget(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Soil organic carbon (dg/kg)") scatter2DTarget(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Soil organic carbon (dg/kg)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Soil organic carbon (dg/kg)")
Code
scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Bulk density (cg/cm³)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Bulk density (cg/cm³)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Bulk density (cg/cm³)")
In this case, since the variables are numerical, I simply multiply the two variables under analysis to evaluate the interaction effect. Taking into account that bloom_doy is the response variable \((y)\) and if we exemplify a simple model that includes the variables annual temperature \((x_1)\) and precipitation \((x_2)\) as predictors, my intention is to validate whether the effect cumulative or joint is present. We can estimate the marginal effect of \(x_1\) and define it as \(\beta_1\), we can do the same with the marginal effect of \(x_2\) and define it as \(\beta_2\), however, it is also of interest to evaluate the interaction effect $ (_3)x_1x_2$, in that order of ideas the mathematical model can be expressed as shown below:
A first interesting result of the interactions evaluated is that the behavior differs between countries, for example, the interaction temperature-precipitation shows an inverse relationship with bloom_doy, but it is not linear in the three countries, in South Korea the relationship tends to be quadratic. We can observe something similar in the temperature-nitrogen interaction, in Japan and South Korea it does not exhibit any relationship with the target variable, however, in Switzerland an inverse linear relationship is evident. In the three countries the interaction temperature-bulk density is negatively related to the response variable, but in Japan the linear dependence that exists between the variable \(y\) and the predictor derived from this interaction is evident.
Later (in document 06-EDA2-FE.qmd) I make use of lasso regression to select predictors and evaluate possible interaction effects that were not represented in this exploratory analysis.
The trend line (red color) in these graphs was obtained with a generalized additive model (GAM) using natural splines with two degrees of freedom.
interactPlot(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Annual precipitation (mm)")interactPlot(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Annual precipitation (mm)")interactPlot(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Annual precipitation (mm)")
Code
interactPlot(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Nitrogen (cg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Nitrogen (cg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Nitrogen (cg/kg)")
Code
interactPlot(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Soil organic carbon (dg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Soil organic carbon (dg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Soil organic carbon (dg/kg)")
Code
interactPlot(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Bulk density (cg/cm³)")interactPlot(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Bulk density (cg/cm³)")interactPlot(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Bulk density (cg/cm³)")
In these graphs I apply the distinct() function for latitude and longitude because the same coordinate (even if they are in different years) has the same bioclimatic, soil and cover data, since this information does not have temporality. For that reason I summarize with the median of the bloom_doy (since this variable does change over time).
Taking advantage of the fact that there is a high correlation between the predictor variables, I run principal components analysis to reduce the dimensionality and try to see if in a smaller space of variables a behavioral pattern related to the response variable is exhibited.
To facilitate visualizations I discretize the response variable into the following levels (ordinal):
Q1:bloom_doy values lower than the value of quartile 1: \(DOY < Q1\).
Q2: values of bloom_doy greater than or equal to the value of quartile 1 and less than the value of quartile 2: \(Q1 \leq DOY < Q2\).
Q3: values of bloom_doy greater than or equal to the value of quartile 2 and less than the value of quartile 3: \(Q2 \leq DOY < Q3\).
Q4:bloom_doy values greater than or equal to the value of quartile 3: \(DOY \geq Q3\).
I introduce the discretized response variable doy_categ to the PCA as a qualitative supplementary variable.
Regardless of the country, the blooms that do not exceed Q1 (DOY < Q1) and those that take longer than Q3 (DOY >= Q3) exhibit differences when graphed in the first three main components, being located in different places . It is important to mention that the pattern is less visible in South Korea. Blooms that are between Q1 and Q3 tend to be similar. This differentiation seen between early flowering and late flowering could be indicative of differential profiles in the ecological niche or environmental environment to which cherry trees are exposed.
In the case of Japan, it is observed that component 1 is positively associated with variables that collect information on temperature (bio1, bio5, bio10, bio11) and precipitation (bio12, bio13, bio16), indicating that values greater than zero in We can associate component 1 with high values of temperature and precipitation, precisely under these conditions we observe rapid flowering (blue dots), on the other hand, this component is negatively associated with latitude, longitude, bio4 (seasonality of the temperature), bio7 (annual temperature range) and to a lesser extent with nitrogen, indicating that values greater than zero of this component will be related not only to high values of temperature and precipitation but also to less variation in temperature (bio4) and lower annual temperature range (bio7), therefore, the described profile is the one that applies to early flowering, just the opposite occurs with late flowering, where low temperatures and precipitation are expected, with greater variations in temperature throughout the year , we could also affirm that plants that are located further south in Japan tend to have delayed flowering. Due to the location of the points we could also affirm that early flowering in Japan is modulated by changes in temperature, however, other factors such as cation exchange capacity (cec) and organic carbon density (ocd) could be influential, since it is observed that these variables have a positive correlation with component 2, indicating that values greater than zero in component 2 are associated with soils with high cation exchange capacity and soils with greater density of organic carbon.
In Switzerland, the difference between early flowering and late flowering is also evident; however, the correlation of the components with the variables allows us to infer behavioral patterns that are slightly different from those in Japan and South Korea. Component one has a positive correlation with variables that collect temperature information (bio4, bio5, bio10, bio11) and exhibits a negative correlation with altitude, soil organic carbon content (soc), soil organic carbon inventory (ocs) and to a lesser extent with precipitation variables (bio12, bio16, bio17), these correlations allow us to understand that early blooms occur in conditions of high temperature (also the annual variation) and in places where there is less rainfall annual, lower soil organic carbon content and located at low altitudes. Cherry trees that exhibit late blooms tend to be located at high altitudes, sites with greater amounts of annual precipitation, higher soil organic carbon contents.
In South Korea, a behavior similar to that of Japan and Switzerland is observed in that temperature is the determining factor in flowering times. However, component 1 is negatively associated with variables that collect temperature information (bio1, bio6, bio9, bio11) and positively with variables that collect information on the annual variation in temperature (bio4, bio7). Late blooms are located to the right of 0 in component 1, that is, these plants are expected to be in places with greater variations in temperature throughout the year and the temperatures of these sites are also expected to be lower. Based on the positive association that component 3 has with precipitation (bio12, bio15, bio16) and observing the location of the green dots in the graph of CP1 vs CP3, it is possible to intuit that delayed flowering is also associated with greater precipitation.
It is important to mention that I make all these interpretations from an exploratory perspective, following the premises of J. W. Tukey as shown in the following diagram. The main idea has been to recognize patterns that allow me to make an objective diagnosis for the modeling phase.
With four principal components, 69.4% of the total variability is retained.
With ten principal components, 89.2% of the total variability is retained.
Decreasing from 41 dimensions to 4 with about 30% of the variability is not bad for graphical purposes, however, for modeling purposes it would be worth reducing from 41 variables to just 10 components.
At this point I am going to perform exploratory analysis with the data provided by USA National Phenology Network. I import the previously integrated files (df_npn_usa.parquet).
In the previous graph, high dispersion is observed in the response variable bloom_doy, for that reason in the following graph I compare the distribution by species.
Code
data_npn |>ggplot(aes(x = bloom_doy)) +facet_wrap(~Species, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by specie",subtitle ="Original scale")data_npn |>ggplot(aes(x = bloom_doy)) +facet_wrap(~Species, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +scale_y_log10() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by specie",subtitle ="Logarithmic scale")
DOY distribution by species and state
In the previous graph, high dispersion is still observed in the response variable bloom_doy, for that reason in the following graph I compare the distribution by species and State.
I filter states with 5 or more records to have a minimum of information and to represent the distribution.
---title: "International Cherry Blossom Prediction Competition"subtitle: "Exploratory Data Analysis 1"author: "Edimer David Jaramillo"date: "`r lubridate::now()`"lang: en-USformat: html: page-layout: article toc: true code-fold: true df-print: paged toc-location: left number-depth: 4 theme: yeti code-copy: true highlight-style: github embed-resources: true code-tools: source: true ---```{r}#| label: setup#| include: falseknitr::opts_chunk$set(echo =TRUE,warning =FALSE,error =FALSE, message =FALSE,fig.align ='center')```# Document description- In this document, exploratory data analysis is carried out with bioclimatic, soil, land cover and human footprint variables. Let us remember that the bioclimatic variables are consolidated from the years 1970 to 2000. The purpose of this document is to compare flowering dates between countries, validate relationships that have been reported in literature (for example, the relationship of temperature with *DOY*), determine the existence of interaction effects and establish which of the predictor variables could eventually be candidates to include in the modeling (*selection of predictors*). We could say that the variables analyzed in this document are *"static"* or *"fixed"*, since we do not have a temporality of them, they are consolidated data in pre-established time windows.- The following document (**06-EDA2-FE.qmd**) shows results of exploratory analysis for *"dynamic"* variables, I refer to them as dynamic since the climatological and photoperiod variables have temporality (data newspapers) and obtained information from January 1, 1981 to December 31, 2023.# Libraries and setup```{r}# Librarieslibrary(tidyverse)library(arrow)library(splines)library(corrr)library(scales)library(FactoMineR)library(factoextra)library(glue)# Theme ggplot2theme_set(theme_bw() +theme(legend.position ="top"))# Functionsfs::dir_ls("../source/r/module-EDA/") |>walk(.f = source)# Inputscountries <-c("Japan", "Switzerland", "South Korea")```# Data- In several of the results of that document `USA-WDC` is not included because it is only a coordinate (lat=38.88535, long=-77.03863) and the results of `USA-NPN` are not shown either, the reason is that the bioclimatic variables are not for the `USA-NPN` coordinates, however, as climate information was provided by *NPN*, I use this information for exploratory analysis at the end of this document and also in the next one (**06-EDA2 -FE.qmd**).```{r}df_full <-read_parquet("../external-data/df_full.parquet")df_full |>head()```# DOY distribution by country```{r}#| fig-width: 7#| fig-height: 5.5#| layout-nrow: 1#| column: screen-inset-shadeddf_full |>filter(location !="NPN") |>ggplot(aes(x = bloom_doy)) +facet_wrap(~country, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by country",subtitle ="Original scale")df_full |>filter(location !="NPN") |>ggplot(aes(x = bloom_doy)) +facet_wrap(~country, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +scale_x_log10() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by country",subtitle ="Logarithmic scale")```# Scatter plot of all vs DOY- In these graphs I apply the distinct() function for latitude and longitude because the same coordinate (even if they are in different years) has the same bioclimatic, soil and cover data, since this information does not have temporality. For that reason I summarize with the **median of the `bloom_doy`** (since this variable does change over time).- The following graphs allow us to show relationships (linear and non-linear) between some of the bioclimatic variables and the day of flowering. Associations are also observed between soil composition, for example, pH and *DOY*, the relationships are not meaningful or clear. relationships between the variables of land cover, human footprint, grasslands, shrubs, water and flowering date.- With these graphs it is validated or confirmed that the temperature (*bio1*, *bio5*, *bio8*, *bio9*, *bio10* and *bio11*) and the flowering day exhibit a relationship that we could suggest is between linear and quadratic, independent of the country, at higher temperatures earlier flowering has been observed. This behavior is similar in Japan and Switzerland. However, South Korea presents some differences, for example, in the *bio11* variable for Japan and Switzerland. , the quadratic type relationship is evident, but in South Korea it is not.- Precipitation (*bio12* to *bio19*) also shows interesting relationships with flowering date, with slight differences between countries, for example, annual precipitation (*bio12*) shows an inverse linear relationship with *DOY*, In Switzerland, on the contrary, we observe a positive quadratic relationship and in South Korea the relationship is not clear. This result may suggest that temperature and precipitation are factors that interact and have a joint effect; Later, interactions between variables are explored to determine covariations of interest.- The seasonality of temperature (*bio4*) shows a different pattern of behavior between countries, in Japan the relationship is positive and linear, in South Korea the association is not clear and in Switzerland it shows an inverse quadratic relationship. Let us remember that the bioclimatic variables (*bio_*) collect information from the years 1970 to 2000, with a resolution of $1 km^2$, this finding will be compared later with the daily temperature information to validate the relationships that are evident in this graphic.- Regarding soil composition, nitrogen, soil water pH, soil organic carbon content (*SOC*), bulk density (*BDOD*) and cation exchange capacity (*CEC* ) show trends in the relationship with *DOY*. The other variables appear to have no association (linear or non-linear) with the target variable.- Finally, the information on cropland (*CROPLAND*), human footprint (*FOOTPRINT*) and land cover do not exhibit any type of relationship that could be considered of interest for subsequent analyses. Furthermore, some of these variables (for example example *SHRUBS*) lack information for the coordinates under analysis.- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}## `r countries[1]````{r}#| fig-width: 9#| fig-height: 16#| fig-align: center#| column: page-inset-rightscatterAllDOY(data = df_full,country_sel = countries[1]) ```## `r countries[2]````{r}#| fig-width: 9#| fig-height: 16#| fig-align: center#| column: page-inset-rightscatterAllDOY(data = df_full,country_sel = countries[2]) ```## `r countries[3]````{r}#| fig-width: 9#| fig-height: 16#| fig-align: center#| column: page-inset-rightscatterAllDOY(data = df_full,country_sel = countries[3]) df_full |>filter(!country %in% countries)```## All the countries```{r}#| fig-width: 9#| fig-height: 16#| fig-align: center#| column: page-inset-rightscatterAllDOY(data = df_full,country_sel =NULL) ```:::# Non-parametric correlations- These correlation profiles show for each country and all countries, from highest to lowest correlation (Spearman). In Japan and South Korea, cherry trees located further north seem to take longer to flower (blue colors), the variable *latitude* is the first in the correlation profile of these countries, however, in Switzerland it is not the latitude The linear factor with the greatest influence is *altitude* followed by soil composition variables and in the top 10 are three variables that collect precipitation information. If we look at the lower part of the graph (red colors), in all countries the annual temperature variable (*bio1*) coincides in being the inverse linear factor of greatest magnitude, followed by other indicator variables also of temperatures. This graph is very useful not only to represent linear relationships but also to obtain a quick characterization of the relationship between climate and phenology in cherry trees located in different countries. We can say that if we think about the factors that are associated with ** lateness** in flowering, the profiles between countries are very different, however, if we think about the factors that are associated with **earliness** in flowering, the profiles between countries are very similar. In conclusion, flowering faster seems to have common and global *"causes"* linked to temperature, however, the delay in flowering could be more difficult to establish and generalize.```{r}#| fig-width: 4.5#| fig-height: 10#| column: screen#| layout-nrow: 1profileCorr(data = df_full, country_sel = countries[1])profileCorr(data = df_full, country_sel = countries[2])profileCorr(data = df_full, country_sel = countries[3])profileCorr(data = df_full, country_sel =NULL)```# Interaction of variables- With the following graphs I intend to recognize if there are interaction behaviors, that is, I try to confirm from visual representations if there are joint effects between the predictor variables. Given that the set of predictor variables is "large" $(P = 41)$, generating all the interactions (double, triple or higher hierarchy) could be costly and time-consuming. The following are the two approaches that I implement: - **1.** Initially I am going to plot scatter plots with the variables that showed trends in the scatter plots and correlation profiles; the color of the dots is determined by the response variable `bloom_doy`. - **2.** I graph the double interaction for some variables of interest vs the response variable.## Scatter plot- In these graphs I apply the distinct() function for latitude and longitude because the same coordinate (even if they are in different years) has the same bioclimatic, soil and cover data, since this information does not have temporality. For that reason I summarize with the **median of the `bloom_doy`** (since this variable does change over time).- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}### Temperature - Precipitation```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Annual precipitation (mm)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Annual precipitation (mm)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Annual precipitation (mm)")```### Temperature - Nitrogen```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1scatter2DTarget(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Nitrogen (cg/kg)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Nitrogen (cg/kg)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Nitrogen (cg/kg)")```### Temperature - SOC```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1scatter2DTarget(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Soil organic carbon (dg/kg)") scatter2DTarget(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Soil organic carbon (dg/kg)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Soil organic carbon (dg/kg)")```### Temperature - BDOD```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C)",label_y ="Bulk density (cg/cm³)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C)",label_y ="Bulk density (cg/cm³)")scatter2DTarget(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C)",label_y ="Bulk density (cg/cm³)")```### Precipitation - Nitrogen```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1scatter2DTarget(data = df_full,var_x ="bio12",var_y ="nitrogen_0-5cm_mean",country_sel = countries[1],label_x ="Annual precipitation (mm)",label_y ="Nitrogen (cg/kg)")scatter2DTarget(data = df_full,var_x ="bio12",var_y ="nitrogen_0-5cm_mean",country_sel = countries[2],label_x ="Annual precipitation (mm)",label_y ="Nitrogen (cg/kg)")scatter2DTarget(data = df_full,var_x ="bio12",var_y ="nitrogen_0-5cm_mean",country_sel = countries[3],label_x ="Annual precipitation (mm)",label_y ="Nitrogen (cg/kg)")```### Altitude - Precipitation```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1scatter2DTarget(data = df_full,var_x ="alt",var_y ="bio12",country_sel = countries[1],label_x ="Altitude",label_y ="Annual precipitation (mm)") +scale_x_log10()scatter2DTarget(data = df_full,var_x ="alt",var_y ="bio12",country_sel = countries[2],label_x ="Altitude",label_y ="Annual precipitation (mm)") +scale_x_log10()scatter2DTarget(data = df_full,var_x ="alt",var_y ="bio12",country_sel = countries[3],label_x ="Altitude",label_y ="Annual precipitation (mm)") +scale_x_log10()```:::## InteractionIn this case, since the variables are numerical, I simply multiply the two variables under analysis to evaluate the interaction effect. Taking into account that *bloom_doy* is the response variable $(y)$ and if we exemplify a simple model that includes the variables annual temperature $(x_1)$ and precipitation $(x_2)$ as predictors, my intention is to validate whether the effect cumulative or joint is present. We can estimate the marginal effect of $x_1$ and define it as $\beta_1$, we can do the same with the marginal effect of $x_2$ and define it as $\beta_2$, however, it is also of interest to evaluate the interaction effect $ (\beta_3)x_1x_2$, in that order of ideas the mathematical model can be expressed as shown below:$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + error$$A first interesting result of the interactions evaluated is that the behavior differs between countries, for example, the interaction *temperature-precipitation* shows an inverse relationship with *bloom_doy*, but it is not linear in the three countries, in South Korea the relationship tends to be quadratic. We can observe something similar in the *temperature-nitrogen* interaction, in Japan and South Korea it does not exhibit any relationship with the target variable, however, in Switzerland an inverse linear relationship is evident. In the three countries the interaction *temperature-bulk density* is negatively related to the response variable, but in Japan the linear dependence that exists between the variable $y$ and the predictor derived from this interaction is evident.Later (in document **06-EDA2-FE.qmd**) I make use of *lasso regression* to select predictors and evaluate possible interaction effects that were not represented in this exploratory analysis.- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}### Temperature - Precipitation```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1interactPlot(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Annual precipitation (mm)")interactPlot(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Annual precipitation (mm)")interactPlot(data = df_full,var_x ="bio1",var_y ="bio12",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Annual precipitation (mm)")```### Temperature - Nitrogen```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1interactPlot(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Nitrogen (cg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Nitrogen (cg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="nitrogen_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Nitrogen (cg/kg)")```### Temperature - SOC```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1interactPlot(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Soil organic carbon (dg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Soil organic carbon (dg/kg)")interactPlot(data = df_full,var_x ="bio1",var_y ="soc_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Soil organic carbon (dg/kg)")```### Temperature - BDOD```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1interactPlot(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[1],label_x ="Annual mean temperature (°C) * Bulk density (cg/cm³)")interactPlot(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[2],label_x ="Annual mean temperature (°C) * Bulk density (cg/cm³)")interactPlot(data = df_full,var_x ="bio1",var_y ="bdod_0-5cm_mean",country_sel = countries[3],label_x ="Annual mean temperature (°C) * Bulk density (cg/cm³)")```### Precipitation - Nitrogen```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1interactPlot(data = df_full,var_x ="bio12",var_y ="nitrogen_0-5cm_mean",country_sel = countries[1],label_x ="Annual precipitation (mm) * Nitrogen (cg/kg)")interactPlot(data = df_full,var_x ="bio12",var_y ="nitrogen_0-5cm_mean",country_sel = countries[2],label_x ="Annual precipitation (mm) * Nitrogen (cg/kg)")interactPlot(data = df_full,var_x ="bio12",var_y ="nitrogen_0-5cm_mean",country_sel = countries[3],label_x ="Annual precipitation (mm) * Nitrogen (cg/kg)")```### Altitude - Precipitation```{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1interactPlot(data = df_full,var_x ="alt",var_y ="bio12",country_sel = countries[1],label_x ="Altitude * Annual precipitation (mm)")interactPlot(data = df_full,var_x ="alt",var_y ="bio12",country_sel = countries[2],label_x ="Altitude * Annual precipitation (mm)")interactPlot(data = df_full,var_x ="alt",var_y ="bio12",country_sel = countries[3],label_x ="Altitude * Annual precipitation (mm)")```::: # Principal Component Analysis (PCA)- In these graphs I apply the distinct() function for latitude and longitude because the same coordinate (even if they are in different years) has the same bioclimatic, soil and cover data, since this information does not have temporality. For that reason I summarize with the **median of the `bloom_doy`** (since this variable does change over time).- Taking advantage of the fact that there is a high correlation between the predictor variables, I run principal components analysis to reduce the dimensionality and try to see if in a smaller space of variables a behavioral pattern related to the response variable is exhibited.- To facilitate visualizations I discretize the response variable into the following levels (ordinal): - **`Q1`:** `bloom_doy` values lower than the value of quartile 1: $DOY < Q1$. - **`Q2`:** values of `bloom_doy` greater than or equal to the value of quartile 1 and less than the value of quartile 2: $Q1 \leq DOY < Q2$. - **`Q3`:** values of `bloom_doy` greater than or equal to the value of quartile 2 and less than the value of quartile 3: $Q2 \leq DOY < Q3$. - **`Q4`:** `bloom_doy` values greater than or equal to the value of quartile 3: $DOY \geq Q3$.- I introduce the discretized response variable `doy_categ` to the PCA as a qualitative supplementary variable.- Regardless of the country, the blooms that do not exceed Q1 (`DOY < Q1`) and those that take longer than Q3 (`DOY >= Q3`) exhibit differences when graphed in the first three main components, being located in different places . It is important to mention that the pattern is less visible in South Korea. Blooms that are between Q1 and Q3 tend to be similar. This differentiation seen between early flowering and late flowering could be indicative of differential profiles in the ecological niche or environmental environment to which cherry trees are exposed.- In the case of Japan, it is observed that component 1 is positively associated with variables that collect information on temperature (bio1, bio5, bio10, bio11) and precipitation (bio12, bio13, bio16), indicating that values greater than zero in We can associate component 1 with high values of temperature and precipitation, precisely under these conditions we observe rapid flowering (blue dots), on the other hand, this component is negatively associated with latitude, longitude, bio4 (seasonality of the temperature), bio7 (annual temperature range) and to a lesser extent with nitrogen, indicating that values greater than zero of this component will be related not only to high values of temperature and precipitation but also to less variation in temperature (bio4) and lower annual temperature range (bio7), therefore, the described profile is the one that applies to early flowering, just the opposite occurs with late flowering, where low temperatures and precipitation are expected, with greater variations in temperature throughout the year , we could also affirm that plants that are located further south in Japan tend to have delayed flowering. Due to the location of the points we could also affirm that early flowering in Japan is modulated by changes in temperature, however, other factors such as cation exchange capacity (*cec*) and organic carbon density (*ocd*) could be influential, since it is observed that these variables have a positive correlation with component 2, indicating that values greater than zero in component 2 are associated with soils with high cation exchange capacity and soils with greater density of organic carbon.- In Switzerland, the difference between early flowering and late flowering is also evident; however, the correlation of the components with the variables allows us to infer behavioral patterns that are slightly different from those in Japan and South Korea. Component one has a positive correlation with variables that collect temperature information (bio4, bio5, bio10, bio11) and exhibits a negative correlation with altitude, soil organic carbon content (*soc*), soil organic carbon inventory (*ocs*) and to a lesser extent with precipitation variables (bio12, bio16, bio17), these correlations allow us to understand that early blooms occur in conditions of high temperature (also the annual variation) and in places where there is less rainfall annual, lower soil organic carbon content and located at low altitudes. Cherry trees that exhibit late blooms tend to be located at high altitudes, sites with greater amounts of annual precipitation, higher soil organic carbon contents.- In South Korea, a behavior similar to that of Japan and Switzerland is observed in that temperature is the determining factor in flowering times. However, component 1 is negatively associated with variables that collect temperature information (bio1, bio6, bio9, bio11) and positively with variables that collect information on the annual variation in temperature (bio4, bio7). Late blooms are located to the right of 0 in component 1, that is, these plants are expected to be in places with greater variations in temperature throughout the year and the temperatures of these sites are also expected to be lower. Based on the positive association that component 3 has with precipitation (bio12, bio15, bio16) and observing the location of the green dots in the graph of CP1 vs CP3, it is possible to intuit that delayed flowering is also associated with greater precipitation.- It is important to mention that I make all these interpretations from an exploratory perspective, following the premises of [J. W. Tukey](https://web.stanford.edu/class/bios221/book/00-chap.html) as shown in the following diagram. The main idea has been to recognize patterns that allow me to make an objective diagnosis for the modeling phase. <center>{width=250}</center>::: {.panel-tabset}## Japan- With four principal components, 69.4% of the total variability is retained.- With ten principal components, 89.2% of the total variability is retained.- Decreasing from 41 dimensions to 4 with about 30% of the variability is not bad for graphical purposes, however, for modeling purposes it would be worth reducing from 41 variables to just 10 components.```{r}#| fig-width: 6.5#| fig-height: 4.8#| column: screen#| layout-nrow: 1pca_japan <-customPCA(data = df_full, country_sel = countries[1])pca_japan$res1pca_japan$res2pca_japan$res3```## Switzerland- With four principal components, 66.4% of the total variability is retained.- With ten principal components, 85.5% of the total variability is retained.```{r}#| fig-width: 6.5#| fig-height: 4.8#| column: screen#| layout-nrow: 1pca_switzerland <-customPCA(data = df_full, country_sel = countries[2])pca_switzerland$res1pca_switzerland$res2pca_switzerland$res3```## South Korea- With four principal components, 66.4% of the total variability is retained.- With ten principal components, 89.7% of the total variability is retained.```{r}#| fig-width: 6.5#| fig-height: 4.8#| column: screen#| layout-nrow: 1pca_southk <-customPCA(data = df_full, country_sel = countries[3])pca_southk$res1pca_southk$res2pca_southk$res3```:::# USA National Phenology Network- At this point I am going to perform exploratory analysis with the data provided by *USA National Phenology Network*. I import the previously integrated files (*df_npn_usa.parquet*).```{r}data_npn_fixed <- df_full |>filter(country =="USA-NPN") |>select(-contains("bio"),-c(shrubs, location, country, year, bloom_date, bloom_doy, alt)) |>distinct(lat, long, .keep_all =TRUE)data_npn <-read_parquet("../external-data/df_npn_usa.parquet")list_species <-unique(data_npn$Species)data_npn |>head()```## Distribution of variables (external data)- **Note:** as these variables are *fixed*, for each coordinate different records were filtered for latitude and longitude.```{r}#| fig-width: 9#| fig-height: 8#| column: page-inset-rightdata_npn_fixed |>pivot_longer(cols =where(is.numeric)) |>ggplot(aes(x = value)) +facet_wrap(~name, scales ="free", ncol =4) +geom_histogram(color ="black")```## Distribution of variables (internal data)- For this graph I filter the data that is greater than -100, due to the existence of data with negative values (-9999).- These variables are dynamic and change over time, unlike the previous ones which are fixed.```{r}#| fig-width: 9#| fig-height: 8#| column: page-inset-rightdata_npn |>select(!all_of(names(data_npn_fixed)))|>select(-c(Site_ID, Individual_ID)) |>pivot_longer(cols =where(is.numeric)) |>filter(value >-100) |>ggplot(aes(x = value)) +facet_wrap(~name, scales ="free", ncol =4) +geom_histogram(color ="black")```## DOY distribution by species- In the previous graph, high dispersion is observed in the response variable *bloom_doy*, for that reason in the following graph I compare the distribution by species.```{r}#| fig-width: 4#| fig-height: 6.5#| layout-nrow: 1#| column: screen-inset-shadeddata_npn |>ggplot(aes(x = bloom_doy)) +facet_wrap(~Species, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by specie",subtitle ="Original scale")data_npn |>ggplot(aes(x = bloom_doy)) +facet_wrap(~Species, ncol =1, scales ="free_y") +geom_histogram(color ="black") +geom_rug() +scale_y_log10() +labs(x ="Bloom DOY", y ="Count",title ="DOY distribution by specie",subtitle ="Logarithmic scale")```## DOY distribution by species and state- In the previous graph, high dispersion is still observed in the response variable *bloom_doy*, for that reason in the following graph I compare the distribution by species and `State`.- I filter states with 5 or more records to have a minimum of information and to represent the distribution.::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 5.8#| fig-height: 4.7#| column: page-inset-rightdistSpecieStateNPN(data = data_npn,specie = list_species[1])```### `r list_species[2]````{r}#| fig-width: 5.8#| fig-height: 4.7#| column: page-inset-rightdistSpecieStateNPN(data = data_npn,specie = list_species[2])```### `r list_species[3]````{r}#| fig-width: 5.8#| fig-height: 4.7#| column: page-inset-rightdistSpecieStateNPN(data = data_npn,specie = list_species[3])```### `r list_species[4]````{r}#| fig-width: 5.8#| fig-height: 4.7#| column: page-inset-rightdistSpecieStateNPN(data = data_npn,specie = list_species[4])```### `r list_species[5]````{r}#| fig-width: 5.8#| fig-height: 4.7#| column: page-inset-rightdistSpecieStateNPN(data = data_npn,specie = list_species[5])```:::## Temperature vs DOY- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 7.5#| fig-height: 3.8 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Temperature",specie = list_species[1])```### `r list_species[2]````{r}#| fig-width: 7.5#| fig-height: 3.8 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Temperature",specie = list_species[2])```### `r list_species[3]````{r}#| fig-width: 7.5#| fig-height: 3.8 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Temperature",specie = list_species[3])```### `r list_species[4]````{r}#| fig-width: 7.5#| fig-height: 3.8 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Temperature",specie = list_species[4]) ```### `r list_species[5]````{r}#| fig-width: 7.5#| fig-height: 3.8 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Temperature",specie = list_species[5]) ```::: ## Precipitation vs DOY- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 7.5#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Precipitation",specie = list_species[1]) ```### `r list_species[2]````{r}#| fig-width: 7.5#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Precipitation",specie = list_species[2]) ```### `r list_species[3]````{r}#| fig-width: 7.5#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Precipitation",specie = list_species[3]) ```### `r list_species[4]````{r}#| fig-width: 7.5#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Precipitation",specie = list_species[4]) ```### `r list_species[5]````{r}#| fig-width: 7.5#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Precipitation",specie = list_species[5]) ```::: ## Others vs DOY- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 9#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Others",specie = list_species[1]) ```### `r list_species[2]````{r}#| fig-width: 9#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Others",specie = list_species[2])```### `r list_species[3]````{r}#| fig-width: 9#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Others",specie = list_species[3])```### `r list_species[4]````{r}#| fig-width: 9#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Others",specie = list_species[4])```### `r list_species[5]````{r}#| fig-width: 9#| fig-height: 2.5 #| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Others",specie = list_species[5])```::: ## Fixed variables vs DOY- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with two degrees of freedom.::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 8#| fig-height: 6.5#| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Fixed",specie = list_species[1])```### `r list_species[2]````{r}#| fig-width: 8#| fig-height: 6.5#| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Fixed",specie = list_species[2]) ```### `r list_species[3]````{r}#| fig-width: 8#| fig-height: 6.5#| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Fixed",specie = list_species[3]) ```### `r list_species[4]````{r}#| fig-width: 8#| fig-height: 6.5#| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Fixed",specie = list_species[4])```### `r list_species[5]````{r}#| fig-width: 8#| fig-height: 6.5#| column: page-inset-right scatterSpeciesNPN(data = data_npn,type ="Fixed",specie = list_species[5]) ```::: ## AGDD and precipitation vs DOY- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with three degrees of freedom.::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[1],var_y ="Accum_Prcp",lab_title ="accumulated precipitation" )res_scatter$res1res_scatter$res2```### `r list_species[2]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[2],var_y ="Accum_Prcp",lab_title ="accumulated precipitation" )res_scatter$res1res_scatter$res2```### `r list_species[3]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[3],var_y ="Accum_Prcp",lab_title ="accumulated precipitation" )res_scatter$res1res_scatter$res2```### `r list_species[4]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[4],var_y ="Accum_Prcp",lab_title ="accumulated precipitation" )res_scatter$res1res_scatter$res2```### `r list_species[5]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[5],var_y ="Accum_Prcp",lab_title ="accumulated precipitation" )res_scatter$res1res_scatter$res2```:::## AGDD and Day length vs DOY- The trend line (red color) in these graphs was obtained with a generalized additive model ([*GAM*](https://en.wikipedia.org/wiki/Generalized_additive_model)) using [*natural splines*](https://en.wikipedia.org/wiki/Spline_(mathematics)) with three degrees of freedom. ::: {.panel-tabset}### `r list_species[1]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[1],var_y ="Daylength",lab_title ="day length" )res_scatter$res1res_scatter$res2```### `r list_species[2]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[2],var_y ="Daylength",lab_title ="day length" )res_scatter$res1res_scatter$res2```### `r list_species[3]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[3],var_y ="Daylength",lab_title ="day length" )res_scatter$res1res_scatter$res2```### `r list_species[4]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[4],var_y ="Daylength",lab_title ="day length" )res_scatter$res1res_scatter$res2```### `r list_species[5]````{r}#| fig-width: 6#| fig-height: 4.5#| column: screen#| layout-nrow: 1res_scatter <-scatterAGDD(data = data_npn,specie = list_species[5],var_y ="Daylength",lab_title ="day length" )res_scatter$res1res_scatter$res2```:::